Goto

Collaborating Authors

 ad generation


MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

Chaudhary, Lipisha, Mittal, Trisha, Gopalakrishnan, Subhadra, Nwogu, Ifeoma, Pytlarz, Jaclyn

arXiv.org Artificial Intelligence

Abstract--Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. T o address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events/actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned Video-LLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions/events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary/subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts. Audio Description (AD) is the descriptive spoken narration of visual content, primarily for assisting visual impairments in accessing visual content [1].


Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Gao, Yingqiang, Fischer, Lukas, Lintner, Alexa, Ebling, Sarah

arXiv.org Artificial Intelligence

Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.


MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

Zhang, Chaoyi, Lin, Kevin, Yang, Zhengyuan, Wang, Jianfeng, Li, Linjie, Lin, Chung-Ching, Liu, Zicheng, Wang, Lijuan

arXiv.org Artificial Intelligence

We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.


Master AI Tools List

#artificialintelligence

The Master AI Tool List was created to share a comprehensive list of sites related to Artificial Intelligence and Machine Learning. We want to make this the source for discovering and sharing the latest AI tools. If you know of an AI Tool that is not currently listed, Use the submission form to request a review. Together we can make this an important resource for everyone. Ai Sofiya is a super Ai tool that can create social media ads in under a minute.